To determine which NBA Players are promising acquisitions to the team —— meaning they are statistically high performing yet underpaid —— a K-means clustering model is appropriate. In ensuring our clustering algorithm is concerned with features that are most indicative of both salary and performance, the first step is to investigate which player statistics have the highest correlation with salary. These features will be used to build the K-means clustering model.
In the process of building the model, it is first crucial to calculate and visualize the optimal number of clusters to explain the players. At this optimal number of centers, It will then be investigated, through graphing and visualization, how each cluster explains player profile.
Visualizing our clusters against both salary and a particularly indicative performance metric will be used to indicate those players that are valuable additions to the team, and underpaid enough that they are likely to transition to our team when offered a higher salary
## 2020-21 Age G GS MP FG FGA
## 1.00000000 0.39028405 0.15046768 0.53757945 0.46486266 0.58107265 0.57338021
## FG% 3P 3PA 3P% 2P 2PA 2P%
## 0.10484552 0.42788671 0.42413825 0.11162594 0.53467499 0.54463327 0.03277157
## eFG% FT FTA FT% ORB DRB TRB
## 0.10428497 0.56782049 0.55506809 0.19096316 0.19927172 0.46545409 0.41729829
## AST STL BLK TOV PF PTS
## 0.59041905 0.44600283 0.22037462 0.57616641 0.28299940 0.59414446
This output contains the correlation of each variable in the data set with salary. To subset to only those features most explanatory of salary, the data to be used in the model will only comprise of features based from correlation cuttoff of .55. PTS (Points), AST (Assists), TOV (Turnovers), FTA (Free throws attempted), FGA (Field goals attempted), FT (Free throws made), and FG (Field goals made) will be used to cluster our players in the model. This data post-normalization and ready for model fitting is exhibited below:
## PTS AST TOV FTA FGA FT FG
## 1 0.24881292 0.23188406 0.35570470 0.22038567 0.27879581 0.16442953 0.24010554
## 2 0.24406458 0.18260870 0.19463087 0.11019284 0.33507853 0.09731544 0.25065963
## 3 0.07122507 0.02028986 0.07382550 0.03305785 0.08246073 0.02684564 0.06596306
## 4 0.11490978 0.04347826 0.08724832 0.06887052 0.11780105 0.06711409 0.11609499
## 5 0.32003799 0.24347826 0.18120805 0.04958678 0.40837696 0.04697987 0.36411609
## 6 0.04463438 0.04347826 0.06711409 0.02479339 0.05628272 0.02348993 0.04485488
The elbow curve suggests that three is the optimal number of clusters.
The NBClust method equally recommends both 2 and 3 clusters.
Thus, the model will be run with both two and three clusters, and the best performing model will be used.
# Model with 2 clusters
set.seed(17)
kmeans_2 = kmeans(clust_data, centers = 2, algorithm = "Lloyd")
#Evaluate the quality of the clustering
betweenss_2 = kmeans_2$betweenss
# Total variance, "totss" is the sum of the distances between all the points in the data set.
totss_2 = kmeans_2$totss
# Variance accounted for by clusters.
(var_exp_2 = betweenss_2 / totss_2)
## [1] 0.5933796
The percentage of variation that is explained by the model with two centers is 59%.
# Model with 3 clusters
set.seed(17)
kmeans_3 = kmeans(clust_data, centers = 3, algorithm = "Lloyd")
#Evaluate the quality of the clustering
betweenss_3 = kmeans_3$betweenss
# Total variance, "totss" is the sum of the distances between all the points in the data set.
totss_3 = kmeans_3$totss
# Variance accounted for by clusters.
(var_exp_3 = betweenss_3 / totss_3)
## [1] 0.7696085
The percentage of variation that is explained by the model with three centers is 77%.
The Model built with three centers has an explained variance of .77 (77%), compared to the model with two centers, which has a much lower explained variance of .59 (59%). Three clusters is the optimal choice for this data.
The above graph shows the relationship between player salary by player points, colored by cluster. My rationale in visualizing the clusters derived from the subsetted data with these axes is that Points (PTS) is most indicative of the performance of a player and also one of the most highly correlated features with salary. Thus, graphing Salary by Points made will allow us to determine those players that are high performing yet underpaid.
The clusters can be interpreted to represent groups of varying skill sets among NBA players, as well as a moderate measure of player salary. The blue group (cluster 2), represents players with the lowest point performance, and are paid a strict range of low salaries. The red group (cluster 1), represents players with a moderate/average range of point performance relative to the entire population, and are paid a moderately varied range of salaries. The green group (cluster 3), represents players with the greatest point performance, yet having the greatest disparity in salary between players in its group. In other words, players in cluster 3 perform at relatively equal levels, yet there is the widest range in salary across the group, and the greatest maximum salary of the entire population.
In determining players with the greatest potential payoff to the team, it is important to look for players in cluster 3 alone. Cluster 3 is the subset of the population that is highest performing. Furthermore, within cluster 3, the most likely to be tempted to convert to our team are those that are paid the lowest of the salary range in this cluster. Thus, I suggest that players Zion Williamson, Luka Doni, and Trae Young be priority recruits for our team. These are players with some of the highest point statistics in their group, yet paid the lowest salaries for their level of performance. Williamson, Doni, and Young are comparable and —— in some cases, higher performing —— to other players in cluster 3 that are paid almost four times as much. Moreover, shockingly, these players have about 57 times the number of points that other players do with similar / equal salaries. Thus, from the model and appropriate visualizations, I conclude that Williamson, Doni, and Young, as well as similar player profiles, are most valuable acquisitions to our team.